Frequency distributions

By Evgenia "Jenny" Nitishinskaya and Delaney Granizo-Mackenzie

Notebook released under the Creative Commons Attribution 4.0 License.

Data is sometimes more useful when presented in a summary. Although there is more information contained in the complete list of observations, a summary that makes the data more comprehensible may be more useful. For instance, we often cite the mean of a set of data instead of listing out all of the values. This is particulary useful when we have a very large number of data points.

A frequency distribution consists of:

a set of intervals, and
the number of observations that fall into each.

Talking about data in intervals often makes more sense than using specific values. For instance, if all 50 observations of a random variable have landed in the interval [99, 101], we expect the next observation to be approximately 100. We do not expect it to come up with any particular value, however, and would not bet that it will be exactly equal to 100.72.

When constructing a frequency distribution, we use some number $k$ of equally sized bins. Each of the $k$ intervals has length (range of data)/$k$.

The usefulness of frequency distributions can be seen below. Binning the data shows that the observations are sampled from a normal distribution much better than the one that marks each data point individually.



In [8]:

    
import numpy as np
import matplotlib.pyplot as plt

# Draw points randomly from a normal distribution
X = np.random.randn(150)
print np.sort(X)









    



[-3.1225869  -2.77091232 -2.35565128 -1.86813543 -1.85346154 -1.68944993
 -1.68258591 -1.64557639 -1.54274686 -1.52442513 -1.41993354 -1.41820754
 -1.36838564 -1.36059377 -1.33431062 -1.32214406 -1.29574032 -1.28072041
 -1.27754403 -1.26119268 -1.20129466 -1.19297187 -1.1669786  -1.11906212
 -1.11283514 -1.09256772 -1.08324074 -0.94184954 -0.93310059 -0.90716823
 -0.90177208 -0.89388499 -0.85105748 -0.81092061 -0.801305   -0.78978186
 -0.78909464 -0.76623016 -0.7551656  -0.7355297  -0.70993823 -0.70015282
 -0.69725693 -0.6492812  -0.63886892 -0.62033627 -0.58890859 -0.47484525
 -0.44045654 -0.43101192 -0.39536643 -0.38445301 -0.37480974 -0.32837588
 -0.30751763 -0.30241748 -0.30117662 -0.28584429 -0.27251828 -0.2514343
 -0.25142267 -0.21467689 -0.21036819 -0.19248571 -0.18729025 -0.17662071
 -0.14624252 -0.13918587 -0.13901212 -0.12559515 -0.10318524 -0.08320543
 -0.06992525 -0.06663636 -0.05008901 -0.0443569  -0.03427594 -0.02835919
 -0.0090919  -0.00853526  0.02122566  0.03765736  0.05419672  0.05473684
  0.09288985  0.10174429  0.10617367  0.10810799  0.14555989  0.16015038
  0.21648776  0.22268832  0.25508164  0.26894377  0.26985161  0.27933158
  0.30482791  0.32336195  0.32616571  0.34922347  0.35161211  0.36692102
  0.38622558  0.39212185  0.39565638  0.39594965  0.40861241  0.43317204
  0.44375096  0.45009674  0.51291147  0.54576957  0.57318852  0.58096726
  0.60295094  0.64976253  0.65493752  0.65899035  0.72413065  0.73114397
  0.73114645  0.73937843  0.76727747  0.88836728  0.9343755   0.98653847
  1.00438895  1.01001819  1.03245763  1.16677684  1.22111905  1.2336709
  1.25009675  1.26019452  1.28547297  1.30474286  1.38694693  1.40411824
  1.43078989  1.57303083  1.62508108  1.67294028  1.81575895  1.87347051
  1.90907199  2.11354726  2.1395832   2.16815783  2.43716148  2.52680317]

It's difficult to tell that the data is approximately normally distributed just by looking at the list above. Below, we sort them into 300 bins; this is more than the number of data points, but many are empty while some have several items because the data is clustered around the mean.



In [15]:

    
# Print list of frequencies of data points
# Second argument is number of bins
# Second output is the locations of bin dividers but we ignore it because there are 301 of them
hist, _ = np.histogram(X, 300)
print 'Data frequencies:', hist









    



Data frequencies: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
 0 0 2 0 1 0 0 0 0 1 1 0 0 0 0 0 2 0 0 2 1 1 0 3 1 0 0 0 2 1 0 0 2 1 1 0 0
 0 0 0 0 1 1 2 1 0 1 0 1 3 0 2 1 0 3 0 0 2 1 0 1 0 0 0 0 0 1 0 2 0 1 2 0 0
 1 3 1 1 2 0 2 2 1 0 3 1 1 1 2 3 1 2 1 1 2 0 1 3 0 1 1 0 0 2 0 1 3 0 2 1 2
 1 4 1 1 2 0 0 0 1 1 0 2 1 0 0 3 0 0 0 3 1 1 0 0 0 0 0 1 0 0 1 0 0 1 2 1 0
 0 0 0 0 0 1 0 0 1 1 2 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0
 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 1]

Frequency distributions are often used in graphing because binned data is visually more comprehensible. For example, the commonly-used graphing library matplotlib.pyplot has a built-in histogram function which requires the number of bins $k$.



In [5]:

    
plt.hist(X,300);

This is still a little noisy, so we can reduce the number of bins (and increase the bin size) further to get a prettier histogram.



In [13]:

    
# Print frequency of data per bin
hist, bins = np.histogram(X, 15)
print 'Bin dividers:', bins
print 'Binned data frequencies:', hist
plt.hist(X,15);









    



Bin dividers: [-3.1225869  -2.7459609  -2.36933489 -1.99270889 -1.61608288 -1.23945688
 -0.86283087 -0.48620487 -0.10957887  0.26704714  0.64367314  1.02029915
  1.39692515  1.77355116  2.15017716  2.52680317]
Binned data frequencies: [ 2  0  1  5 12 12 15 23 23 22 13  9  5  5  3]

However, if bin sizes are too large, the data ceases to be very informative:



In [14]:

    
# Print frequency of data per bin
hist, bins = np.histogram(X, 4)
print 'Bin dividers:', bins
print 'Binned data frequencies:', hist
plt.hist(X,4);









    



Bin dividers: [-3.1225869  -1.71023938 -0.29789187  1.11445565  2.52680317]
Binned data frequencies: [ 5 52 72 21]